This is a starter RMarkdown template to accompany Data Visualization (Princeton University Press, 2019). You can use it to take notes, write your code, and produce a good-looking, reproducible document that records the work you have done. At the very top of the file is a section of metadata, or information about what the file is and what it does. The metadata is delimited by three dashes at the start and another three at the end. You should change the title, author, and date to the values that suit you. Keep the output line as it is for now, however. Each line in the metadata has a structure. First the key (“title”, “author”, etc), then a colon, and then the value associated with the key.
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. A code chunk is a specially delimited section of the file. You can add one by moving the cursor to a blank line choosing Code > Insert Chunk from the RStudio menu. When you do, an empty chunk will appear:
Code chunks are delimited by three backticks (found to the left of the 1 key on US and UK keyboards) at the start and end. The opening backticks also have a pair of braces and the letter r, to indicate what language the chunk is written in. You write your code inside the code chunks. Write your notes and other material around them, as here.
To install the tidyverse, make sure you have an Internet connection. Then manually run the code in the chunk below. If you knit the document if will be skipped. We do this because you only need to install these packages once, not every time you run this file. Either knit the chunk using the little green “play” arrow to the right of the chunk area, or copy and paste the text into the console window.
## This code will not be evaluated automatically.
## (Notice the eval = FALSE declaration in the options section of the
## code chunk)
my_packages <- c("tidyverse", "broom", "coefplot", "cowplot",
"gapminder", "GGally", "ggrepel", "ggridges", "gridExtra",
"here", "interplot", "margins", "maps", "mapproj",
"mapdata", "MASS", "quantreg", "rlang", "scales",
"survey", "srvyr", "viridis", "viridisLite", "devtools")
install.packages(my_packages, repos = "http://cran.rstudio.com")
We also need to download the socviz library from GitHub.
devtools::install_github("kjhealy/socviz")
To begin we must load some libraries we will be using. If we do not load them, R will not be able to find the functions contained in these libraries. The tidyverse includes ggplot and other tools. We also load the socviz and gapminder libraries.
Notice that here, the braces at the start of the code chunk have some additional options set in them. There is the language, r, as before. This is required. Then there is the word setup, which is a label for your code chunk. Labels are useful to briefly say what the chunk does. Label names must be unique (no two chunks in the same document can have the same label) and cannot contain spaces. Then, after the comma, an option is set: include=FALSE. This tells R to run this code but not to include the output in the final document.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
The remainder of this document contains the chapter headings for the book, and an empty code chunk in each section to get you started. Try knitting this document now by clicking the “Knit” button in the RStudio toolbar, or choosing File > Knit Document from the RStudio menu.
c() is a function, where c is short for “combine” or “concatenate”. It takes a sequence of comma-separated elements in brackets and joins them into a vector where each element is still individually accessible.
c(1, 2, 3, 1, 3, 5, 25)
## [1] 1 2 3 1 3 5 25
We can assign this to a variable. Use Alt + - to produce the assignment symbol <-.
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
your_numbers <- c(5, 31, 71, 1, 3, 21, 6)
Type in the variable name to see the assigned object.
my_numbers
## [1] 1 2 3 1 3 5 25
We can pass these numbers as an argument to functions.
mean() finds the mean of a set of numbers.
mean(my_numbers)
## [1] 5.714286
You can assign the result of a function to a variable and output it.
summary() gives some summary statistics of the numbers.
my_summary <- summary(my_numbers)
my_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 3.000 5.714 4.000 25.000
table() provides a count of each element.
table(my_numbers)
## my_numbers
## 1 2 3 5 25
## 2 1 2 1 1
If we multiply a vector by a number, each element in that vector gets multiplied by that number.
my_numbers * 5
## [1] 5 10 15 5 15 25 125
If we add a number to a vector, that number is added to each element in turn.
my_numbers + 1
## [1] 2 3 4 2 4 6 26
If we add vectors of the same length (such as adding a vector to itself), each element in one vector is added to the corresponding element in the other vector.
my_numbers + my_numbers
## [1] 2 4 6 2 6 10 50
Every object has a class. Use the class() function to find the class of an object.
class(my_numbers)
## [1] "numeric"
class(my_summary)
## [1] "summaryDefault" "table"
class(summary)
## [1] "function"
Actions can change a class. Adding a character to a numeric vector will turn the whole object to a character and the numbers will be enclosed in quotes.
my_new_vector <- c(my_numbers, "Apple")
my_new_vector
## [1] "1" "2" "3" "1" "3" "5" "25" "Apple"
The most common type of data object in R is a data frame, which consists of a rectangular table consisting of rows (of observations) and columns (of variables).
Here is a small dataset from the socviz library:
titanic
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
class(titanic)
## [1] "data.frame"
The $ operator allows you to pick out a named column of a data frame:
titanic$percent
## [1] 62.0 5.7 16.7 15.6
A tibble is an augmented data frame. We can convert a data frame to a tibble:
titanic.tb <- as_tibble(titanic)
titanic.tb
## # A tibble: 4 x 4
## fate sex n percent
## <fct> <fct> <dbl> <dbl>
## 1 perished male 1364 62
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
The str() function lets you see inside an object.
Objects can be simple…
str(my_numbers)
## num [1:7] 1 2 3 1 3 5 25
…or objects can be more complicated, although they are usually organized collections of simpler objects.
str(my_summary)
## 'summaryDefault' Named num [1:6] 1 1.5 3 5.71 4 ...
## - attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
In ggplot, we will build up plots a piece at a time by adding expressions to one another. When doing this, make sure your + character goes at the end of the line, like this…
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
…not like this:
ggplot(data = mpg, aes(x = displ, y = hwy))
+ geom_point()
Use the read_csv() function to read in comma-separated data.
Give the function an url and it will fetch the data. A message will be printed at the console, telling us that a class has been assigned to each column of the object it has created.
url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv"
organs <- read_csv(file = url)
## Parsed with column specification:
## cols(
## .default = col_double(),
## country = col_character(),
## world = col_character(),
## opt = col_character(),
## consent.law = col_character(),
## consent.practice = col_character(),
## consistent = col_character(),
## ccode = col_character()
## )
## See spec(...) for full column specifications.
Let us make a scatterplot of the gapminder data.
Let’s take a look at the data first.
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
We will make a scatterplot of lifeExp (life expectancy) against gdpPerCap (GDP per capita).
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point()
At the end of Chapter 2, we plotted a graph using ggplot. The steps are always the same.
Data
First we tell ggplot what data we are using, using the data argument:
p <- ggplot(data = gapminder)
Aesthetic mappings
Second, we tell ggplot which variables in the data should be mapped to visual elements in the plot, using the mapping argument.
It is passed the aes() function (for aesthetics).
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp))
This says that the variable on the x-axis will be gdpPercap and the variable on the y-axis will be lifeExp.
But by this point, we don’t yet have a graph…
p
Type of plot
Add a layer specifying the type of plot you want by picking a geom_ function.
We will use geom_point() to plot the x and y values as a scatterplot:
p + geom_point()
Plots are built up by adding layers one at a time. It really is an additive process.
Let’s try a different geom_ function.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This creates a smoothed line and adds a shaded ribbon showing the standard error of the line.
If we want to see the data points and the line together, we simply add geom_point() back in.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Notice in the console message that geom_smooth() is using method = gam (for generalized additive model).
We could instead use method = lm for a linear model.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method = "lm")
We haven’t had to tell geom_point() or geom_smooth() where to get its data from. It inherits it from the p object.
Looking at our data, it is all bunched up against the left hand side. The scale would look better if it was transformed from a linear scale to a log scale.
Add the scale_x_log10() function to p:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method = "gam") + scale_x_log10()
Notice the scientific notation used on the x-axis now we are using a log scale. Let’s change to a sensible scale and use $ values (the unit of GDP per capita).
We will use the scales package’s dollar() function. Rather than downloading the package, grab the function from it directly using the syntax thepackage::thefunction.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar)
We can reformat the text under the tick marks using other labels functions.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::comma)
An aesthetic mapping specifies that a variable will be expressed by one of the available visual elements, such as size, or colour, or shape.
We map variables to aesthetics like this:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = continent))
This code does not give a direct instruction like “colour the points purple”. Instead it says, “the property colour will represent the variable continent” or “colour will map continent”.
Let’s see what this looks like:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = continent))
p + geom_point() + scale_x_log10(labels = scales::dollar)
Different colours are used to represent points with different continent properties.
If we want to turn all the points in the figure purple, we do not do it through the mapping function. Look at what happens when we try:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = "purple"))
p + geom_point() + scale_x_log10(labels = scales::dollar)
What is going on?
mapping wants to map the property colour to a variable and assumes it will get a variable. We give it “purple”. So every row of data is assigned a categorical variable, purple, which has the value of “purple”. We have created a new column of data, every item of which is “purple”.
ggplot maps this variable to the colour aesthetic, using its default first colour of red.
The aes() function is for mapping only, not for setting a property value. If we want to set a property value, do it in the geom_ function, outide the mapping = aes(…) step.
Let’s try this. Set geom_point()’s colour property to “purple:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(colour = "purple") + scale_x_log10(labels = scales::dollar)
We can change the look by giving different arguments to the geom_ functions. alpha sets the transparency (0 fully transparent, 1 fully opaque). se is a boolean, which turns the standard error ribbon on and off.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha = 0.3) +
geom_smooth(colour = "orange", se = FALSE, size = 8, method = "lm") +
scale_x_log10(labels = scales::dollar)
The lab() function controls the main labels of the plot, as well as title, subtitle and caption.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha = 0.3) +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
Let colour map the continent:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp,
colour = continent))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
We have five smoothing lines and standard error ribbons, one for each continent. This is a consequence of the way aesthetic mappings are inherited. mapping = aes(…) is set in the call to ggplot used to create the p object. geom_point() and geom_smooth() inherit from this.
We can set the shading of this standard error ribbon to match its dominant colour, using the fill property. Whereas colour affects the appearence of lines and points, fill is for the filled areas of bars, polygons and the interiors of the smoother’s standard error ribbon.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp,
colour = continent, fill = continent))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
Perhaps five separate smoothers is too many, and we just want one line. But we would still like to have the points colour-coded by continent.
By default, geoms inherit their mappings from the ggplot() function. We will map x and y in the ggplot() function as usual, which will be inherited by the geom_ functions. We will then use mapping = aes(colour = continent) only in geom_point(). This ensures that the points are colour-coded by continent, but geom_smooth() will only plot one line, as it does not map continent in any way.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(colour = continent)) +
geom_smooth(method = "loess") +
scale_x_log10()
It is possible to map continuous variables to the colour aesthetic. We can map the log of each country-year’s population, pop, to colour. (We can take the log of population right in the aes() statement using the log() function). When we do this, ggplot produces a gradient scale.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(colour = log(pop))) + scale_x_log10()
The gradient scale of the colour is continuous but is marked at intervals in the legend. Depending on the circumstances, mapping quantities like population to a continuous colour gradient may be more or less effective than cutting the variable into categorial bins.
We can set the default size of plots within our .Rmd document. This command tells R to make 8 x 5 figures:
knitr::opts_chunk$set(fig.width = 8, fig.height = 5)
We can change the size of particular plots by placing the same options to any particular chunk inside the curly brackets at the beginning.
A figure can be saved to a file using the ggsave() function. To save the most recently displayed figure, provide the name we want to save it under:
ggsave(filename = "my_figure.png")
## Saving 8 x 5 in image
This will save the figure as a PNG file. If we want a PDF file instead, change the extension of the file:
ggsave(filename = "my_figure.pdf")
## Saving 8 x 5 in image
We do not need to write filename = as long as the name of the file is the first argument in ggsave(). We can also pass plot objects to ggsave(). For example, we can put our most recent plot into an object called p_out and then tell ggsave() we want to save that object.
p_out <- p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
ggsave("other_figure.pdf", plot = p_out)
## Saving 8 x 5 in image
When saving your work, it is useful to have one or more subfolders where you save only figures. You should also take care to name your saved figures in a sensible way: fig_1.pdf or my_figure.pdf are not good names.
Create a folder named figures. Use the here library. here() outputs the file path. In this case “C:/Users/Steven/Documents/Data Visualization - A Practical Introduction”.
here()
## [1] "C:/Users/Steven/Documents/Data Visualization - A Practical Introduction"
We can then us the here() function to make saving our work much easier. Assuming a folder named figures exists in the project folder, we can do this:
ggsave(here("figures", "here_figure.pdf"), plot = p_out)
## Saving 8 x 5 in image
In general, you should save your work in several different formats and in different sizes. You can use scale, or set the height and width (and units) explicitly.
ggsave(here("figures", "sized_figure.pdf"), plot = p_out,
height = 8, width = 10, units = "in")
What happens when you put the geom_smooth() function before geom_point()…
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_smooth() + geom_point()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
…instead of after it?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
If geom_smooth() comes before geom_point(), then the smoothed curve is shown beneath the points and is obscured by it.
If geom_point() comes before geom_smooth(), then the smoothed curve will be plotted over the points, obscuring them.
Plots are built layer by layer, with the later layers being placed on top.
Change the mappings in the aes() function so that you plot life expectancy against population (pop) rather than per capita GDP.
p <- ggplot(data = gapminder, mapping = aes(x = pop, y = lifeExp))
p + geom_point()
Each point represents a country-year. The units of each point are taken from the dataframe, lifeExp is in years, pop is number of people.
The majority of the points are bunched on the left, the countries having relatively low populations.
There are then around 25 points further to the right, where lifeExp (starting from a low base) seems to increase with pop. This is the case of China, which starts with low life expectancy which then grows, along with its population.
Try some alternative scale mappings. Besides scale_x_log10(), you can try scale_x_sqrt() and scale_x_reverse(). There are corresponding functions for y-axis transformations. Just write y instead of x.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_x_sqrt() +
labs(title = "scale_x_sqrt()")
The points are scaled as if each x value is square rooted. The lower values appear more stretched out, and larger values are compressed.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_y_sqrt() +
labs(title = "scale_y_sqrt()")
The points are scaled as if each y value is square rooted. The lower values appear more stretched out, and larger values are compressed.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_x_reverse() +
labs(title = "scale_x_reverse()")
The x values are reversed, so points go right-to-left.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
scale_y_reverse() +
labs(title = "scale_y_reverse()")
The y values are reversed, so points go up-to-down.
What happens if you map colour to year instead of continent?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = year))
p + geom_point()
The year variable is a number. Higher numbers (later years) are given a lighter colour than lower numbers (earlier years), as seen in the colour scale in the legend. Countries generally seem to have life expectancy and GDP per capita increasing as the years progress.
Instead of mapping colour = year, what happens if you try colour = factor(year)?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = factor(year)))
p + geom_point()
factor() makes the year value categorical. Each year is a discrete category and gets its own colour.
What might be a better visualization of our data, that does not ignore its temporal and country-level structure?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(colour = country, alpha = year)) +
geom_smooth(method = "lm") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme(legend.position = "none")
Each country is mapped to a colour and the year is mapped to alpha, so later years are more opaque. I have removed the legend, due to there being so many countries, but you would need to see which colour represents each country.
Beginning with the gapminder dataset, imagine we wanted to see how GDP per capita changes over time. Let’s plot year and gdpPercap for each country-year:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_point()
Imagine we wanted to plot the trajectory of GDP per capita for each country by joining them with lines:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line()
This is not what we wanted. Points for each year have been joined, whereas we wanted points for each country to be joined.
Use the group aesthetic to tell ggplot explicitly about the country-level structure:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country))
The lines now show the trajectory of GDP per capita of a country over time.
We have used the group aesthetic because the grouping we wanted (country) was not built into the variables being mapped (year and gdpPercap). There is no information in the year variable itself to let ggplot know that it is grouped by country.
The previous plot is very messy. One option is to facet the data by some third variable, making a “small multiple” plot.
A separate panel is drawn for each value of the faceting variable. Facets are not a geom, but a way of organising a series of geoms. In this case, we will use facet_wrap() to split the plot by continent. Pass continent as an argument with a ~.
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country)) +
facet_wrap(~continent)
We can add a smoother and some cosmetic enhancements.
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(colour = "gray70", aes(group = country)) +
geom_smooth(size = 1.1, method = "loess", se = FALSE) +
scale_y_log10(labels=scales::dollar) +
facet_wrap(~continent, ncol = 5) +
labs(x = "Year",
y = "GDP per capita",
title = "GDP per capita on Five Continents")
Data can be cross-classified by two categorical variables by using facet_grid(). We will use the gss_sm dataset, which is a small subset of the questions from the 2016 General Social Survey. Unlike the gapminder dataset, it contains many categorical variables.
We wish to make a smoothed scatterplot of the relationship between the age of the respondent (age) and the number of children they have (childs). We will facet this relationship by sex and race, using facet_grid(sex ~ race):
p <- ggplot(data = gss_sm,
mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(sex ~ race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
Further categorical variables can be added, such as facet_grid(sex ~ race + degree), which will have a row for each sex variable and a column for each combination of race and degree.
Whereas geom_point() just plots a point with given x and y coordinates, other geoms transform data before they are plotted (think of the smoother created by geom_smooth()).
Every geom_ function has an associated stat_ function and vice versa.
Sometimes the calculations done by the stat_ functions are not immediately obvious. Consider geom_bar():
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar()
The bar height gives the count of observations from each region of the USA. There is a y-axis variable, called count, that is not in the data but has been calculated for us. Behind the scenes, geom_bar() calls its default stat_ function, stat_count(). The function computes two new variables, count and prop (short for proportion). If we want to use prop in our bar chart, it must be used as a mapping. The relevant argument is ..prop.. (we need it to begin and end with two periods so that it won’t be confused if there is already a prop variable in our data. Use mapping = aes(variable = ..statistic..)
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop..))
This is not what we want. The data is being grouped by the x-categories, whereas we want the whole data to be grouped, so each bar represents the proportion of the whole data. We do this using group = 1 inside the aes() call (1 being a “dummy group” representing the whole dataset).
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop.., group = 1))
The gss_sm data contains a religion variable. Let’s graph this as a bar chart, with a colour for each religion:
p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion))
p + geom_bar() + guides(fill = FALSE)
Both x and fill are mapped to religion. Because we do not need a legend showing which colour represents each religion (we can read that from the x-axis), we turn off the legend using guides(fill = FALSE).
A more appropriate use of the fill aesthetic with geom_bar() is to cross-classify two categorical variables. For example, to examine the distribution of religious preferences within different regions of the United States. (Note: This is not the most straightforward way of producing these bar charts. We will see a better way in the next chapter, where we calculate a table first).
Let’s look at the breakdown of religion by region; that is, we want the religion variable broken down proportionally within bigregion. Map fill to religion:
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar()
This is a stacked bar chart, where the counts of each religion are stacked within each bar. A problem with this is that each religion is unaligned and so it is hard to compare the heights.
An alternative is to set the position argument to “fill” in the geom_ function:
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "fill")
This makes it easier to compare the proportion of each religion in a region, but we lose the relative sizes of the regions.
We could also set position = “dodge” to place the religions side by side in each region…
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge")
…but this shows counts, not proportions. We have already seen that using y = ..prop.. can be used to use proportions…
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop..))
…but this is not useful. The problem is we are seeing the proportion of Protestants that are Protestant (and Catholics are Catholic etc.) in each region, which is 100% in each case. Previously we fixed this using the group argument. If we set group = religion…
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = religion))
…then we see what proportion of Protestants live in each region (and Catholics etc). So we see that nearly half of all Protestants live in the South. The bars for each religion sum to one across the regions, but the bars do not sum to one in each region.
The easiest thing to do is to stop trying to force geom_point() to do all the work in a single step. Instead, we ask ggplot to give us a proportional bar chart of religious affiliation, and then facet that by region.
p <- ggplot(data = gss_sm, mapping = aes(x = religion))
p + geom_bar(mapping = aes(y = ..prop.., group = bigregion)) + facet_wrap(~bigregion)
Note: in this case we group by bigregion, not religion. Otherwise we will just find that 100% of Protestants in a region are Protestant.
A histogram is a way of summarizing a continuous variable by chopping it up into segments or “bins” and counting how many observations are found within each bin. We have to decide how finely to bin the data.
The midwest dataset contains information on counties in several midwestern states of the United States. Counties vary in size, so we can make a histogram showing the distribution of their geographical areas (measured in square miles).
We need to divide the observations in to bins. geom_histogram() will choose a bin size for us based on a rule of thumb. The histogram displays a count of observations in each bin.
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can set the number of bins using bins, or the width of each bin using binwidth:
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram(bins = 10)
As with bar charts, a newly created count variable is created and displayed.
We can display multiple histograms in one plot. Let’s subset our data so we only look at counties in two states: Ohio “OH” and Wisconsin “WI”, giving them different fills.
oh_wi <- c("OH", "WI")
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)
An alternative to a histogram is to calculate a kernel density estimate of the underlying distribution. The geom_density() function will do this:
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_density()
A similar plot can be achieved using geom_line(stat = “density”), which removes the lines at the sides and bottom of the area (and doesn’t allow a fill):
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_line(stat = "density")
We can use colour and fill for geom_density() too. We could map each state to a different colour and fill, allowing us to see them in one plot.
p <- ggplot(data = midwest, mapping = aes(x = area, colour = state, fill = state))
p + geom_density(alpha = 0.3)
When using geom_bar(), we saw we could use the ..prop.. statistic for a proportional measure instead of the ..count.. statistic. We can do something similar with geom_histogram() and geom_density() using their stat_ functions.
From geom_density(), the stat_density() function can return its default ..density.. statistic, or ..scaled.., which will give a proportional density estimate.
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = area, fill = state, colour = state))
p + geom_density(alpha = 0.3, mapping = aes(y = ..scaled..))
It can also return a statistic called ..count.., which is the density times the number of points. This can be used in stacked density plots.
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
mapping = aes(x = area, fill = state, colour = state))
p + geom_density(alpha = 0.3, mapping = aes(y = ..count..))
Sometimes data will already have counts or proportions in them…
titanic
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
…so we do not need a stat_ function to calculate these things for us. Use stat = “identity” in the geom_ function. (We’ll also move the legend to the top of the plot):
p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex))
p + geom_bar(position = "dodge",
stat = "identity") + theme(legend.position = "top")
geom_col() has the same effect as using geom_bar(stat = “identity”).
We can also use position = “identity” to plot the values as given. This lets us plot a flow of positive and negative values in a bar chart. This is useful when looking at changes relative to some threshold level or baseline. The oecd_sum table in socviz contains information on average life expectancy at birth within the USA and other OECD countries.
oecd_sum
## Warning: Detecting old grouped_df format, replacing `vars` attribute by
## `groups`
## # A tibble: 57 x 5
## # Groups: year [57]
## year other usa diff hi_lo
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1960 68.6 69.9 1.3 Below
## 2 1961 69.2 70.4 1.2 Below
## 3 1962 68.9 70.2 1.30 Below
## 4 1963 69.1 70 0.9 Below
## 5 1964 69.5 70.3 0.800 Below
## 6 1965 69.6 70.3 0.7 Below
## 7 1966 69.9 70.3 0.400 Below
## 8 1967 70.1 70.7 0.6 Below
## 9 1968 70.1 70.4 0.3 Below
## 10 1969 70.1 70.6 0.5 Below
## # ... with 47 more rows
The other column is the average life expectancy in a given year for countries excluding the United States. The usa column is the U.S. life expectancy. diff is the difference between the two values and hi_lo indicates whether the U.S. value is higher or lower than the OECD average.
We will plot the difference over time and use the hi_lo variable to colour the columns in the chart.
p <- ggplot(data = oecd_sum, mapping = aes(x = year, y = diff, fill = hi_lo))
p + geom_col() + guides(fill = FALSE) +
labs(x = NULL, y = "Difference in Years",
title = "The US Life Expectancy Gap",
subtitle = "Difference between US and OECD average life expectancies, 1960-2015",
caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017.")
## Warning: Removed 1 rows containing missing values (position_stack).
Revisit the gapminder plots at the beginning of the chapter and experiment with different ways to facet the data.
Try plotting population and per capita GDP while faceting on year
p <- ggplot(data = gapminder, mapping = aes(x = pop, y = gdpPercap))
p + geom_point(colour = "gray70") + geom_smooth(se = FALSE) + facet_wrap(~year)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Try plotting population and per capita GDP while faceting on country
p <- ggplot(data = gapminder, mapping = aes(x = pop, y = gdpPercap))
p + geom_point(colour = "gray70") + geom_smooth(se = FALSE) + facet_wrap(~country)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Investigate the difference between a formula written as facet_grid(sex ~ race) and one written as facet_grid(~ sex + race).
If we use facet_grid(sex ~ race)…
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) + geom_smooth() + facet_grid(sex ~ race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
…the facets break out the data into sex (rows) and race (columns).
If we use facet_grid(~ sex + race)…
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) + geom_smooth() + facet_grid(~ sex + race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
…the facets break out the data for each combination of sex and race.
Experiment to see what happens when you use facet_wrap() with more complex formulas like facet_wrap(~ sex + race) instead of facet_grid().
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) + geom_smooth() + facet_wrap(~ sex + race)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
facet_wrap() breaks out the data for each combination of sex and race (as does facet_grid()) but lays out the results in a wrapped 1D table rather than a fully cross-classified grid.
Frequency polygons are closely related to histograms. Instead of displaying the count of observations using bars, they display it with a series of connected lines. Try the various geom_histogram() calls in this chapter using geom_freqpoly() instead.
Default geom_freqpoly()
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Define the Number of Bins
We can still define bins and binwidth…
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_freqpoly(bins = 10)
Subset the Data
oh_wi <- c("OH", "WI")
p <- ggplot(data = midwest, subset = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = percollege))
p + geom_freqpoly(bins = 20)
A histogram bins observations for one variable and shows a bar with the count in each bin. We can do this for two variables at once, too. The geom_bin2d() function takes two mappings, x and y. It divides the plot into a grid and colours the bins by the count of observations in them. Try plotting it on the gapminder data to plot life expectancy versus per capita GDP.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_bin2d()
The size of the bins needs to be set in 2 dimensions now, e.g. bins = c(20,50).
Density estimates can also be drawn in two dimensions. The geom_density_2d() function draws contour lines estimating the joint distribution of two variables.
Try it with the midwest data, plotting percent below the poverty line (percbelowpoverty) against percent college-educated (percollege).
Try it with a geom_point() layer…
p <-ggplot(data = midwest, mapping = aes(x = percollege, y = percbelowpoverty))
p + geom_density_2d() + geom_point()
…and without a geom_point() layer
p <-ggplot(data = midwest, mapping = aes(x = percollege, y = percbelowpoverty))
p + geom_density_2d()